arXiv: 1509.00061vl [cs.LG] 31 Aug 2015 


Value Function Approximation via Low Rank Models 


Hao Yi Ong^ 


Abstract — We propose a novel value function approximation 
technique for Markov decision processes. We consider the prob¬ 
lem of compactly representing the state-action value function 
using a low-rank and sparse matrix model. The problem is to 
decompose a matrix that encodes the true value function into 
low-rank and sparse components, and we achieve this using 
Robust Principal Component Analysis (PCA). Under minimal 
assumptions, this Robust PCA problem can be solved exactly 
via the Principal Component Pursuit convex optimization prob¬ 
lem. We experiment the procedure on several examples and 
demonstrate that our method yields approximations essentially 
identical to the true function. 

I. Introduction 

One way to solve Markov decision processes (MDPs) is 
to compute the state-action value function from which the 
optimal policy can be extracted. The value function can 
be represented as a matrix where each entry corresponds 
to the value for a state-action pair. For practical MDPs, 
data encoding the value function routinely lie in millions 
or even billions of dimensions. The ability to accurately 
represent this data on a compact basis would presumably 
have an impact on a wide area of disciplines that rely on 
stochastic decision making, including robotics, automated 
control, economics, and manufacturing. 

We consider the problem of compactly approximating an 
MDP state-action value function. To alleviate the curse of 
dimensionality and scale,' we must leverage on the fact that 
the value functions have low intrinsic dimensionality. That 
is, they lie on some low-dimensional subspace [1], are sparse 
in some basis [2], or lie on some low-dimensional manifold 
[3], [4]. The foundation of our approach is similar to that 
in value function approximation in reinforcement learning 
(RL). In RL, researchers have employed a wide variety of 
basis function schemes to approximate value functions, most 
commonly radial basis functions and CMACs [5]. Implicit 
in these methods is the assumption that the value functions 
can be captured accurately by a small set of features; i.e., an 
intrinsic assumption about low-dimensionality. 

In our problem, we decompose a data matrix formed by 
the state-action values as a low-rank part plus a residual, 
which is not necessarily sparse (as we would like). The 
data matrix is thus modeled as a superposition of a low- 
rank component and a sparse component. This is posed as a 
matrix decomposition problem under the broader framework 
of Robust Principal Component Analysis (PCA). In general, 
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accurate decomposition of a matrix is impossible; but the 
knowledge that the matrix has low rank radically changes 
this premise, making the search for solutions meaningful [6], 
[7]. As far as the author knows, this is a novel application 
of Robust PCA to value function approximation. 

The rest of this paper is organized as follows. Section II 
introduces the mathematical formulation of our problem, 
followed by Section III, which presents an approach for 
Robust PCA. Section IV validates our approach through 
several numerical experiments, and Section V concludes with 
a few remarks on future work. 

II. Value Function Decomposition 

We begin with a brief review on MDPs followed by an 
abstract dehnition of the decomposition problem. 

A. Markov decision process 

In an MDP, an agent chooses action at at time t after 
observing state St- The agent then receives reward rt, and 
the state evolves probabilistically based on the current state- 
action pair. The explicit assumption that the next state only 
depends on the current state-action pair is referred to as the 
Markov assumption. An MDP can be defined by the tuple 
(S, A,T, R), where S and A are the sets of all possible 
states and actions, respectively, T is a probabilistic transition 
function, and R is a reward function. T gives the probability 
of transitioning into state s' from taking action a at the 
current state s, and is often denoted T {s,a,s'). R gives a 
scalar value indicating the immediate reward received for 
taking action a at the current state s and is denoted R (s,a). 

To solve an MDP, we compute a policy tt* that, if 
followed, maximizes the expected sum of immediate rewards 
from any given state. The optimal policy is related to the 
optimal state-action value function Q*(s,a), which is the 
expected value when starting in state s, taking action a, 
and then following actions dictated by :7r*. Mathematically, 
it obeys the Bellman recursion 

Q* (s,a) = R (s,a) ■+■ ^ T [s,a,s') max Q* (s',a'). 

s'eS 

The state-action value function can be computed using a 
dynamic programming algorithm called value iteration. To 
obtain the optimal policy for state s, we compute 

It* (s) = argmax Q* {s,a). 
asA 



B. Matrix decomposition 

Suppose matrix M e encodes the state-action values 
of an MDP, where m and n are the cardinalities of the 
state and action spaces, respectively. Intuitively, this scheme 
leverages the correlation between action values close to each 
other. We approximate M via the decomposition 

M = Lo + So, 

where Lq has low-rank and So is sparse; here, both com¬ 
ponents are of arbitrary magnitude. We have no knowledge 
of the low-dimensional column and row space of Lq, or its 
dimensionality. Similarly, we do not know the locations and 
number of the nonzero entries of Sq. We wish to obtain 
Lo and So, a low-rank plus sparse approximation of the true 
state-action value function. This is achieved via Robust PCA. 

C. Robust PCA 

Classical PCA [1], [8], [9] seeks the best (in an sense) 
rank-L estimate of Lq by solving 

minimize ||M —L|||- 
subject to rank L < k, 

with variable L. Here, H-H^ denotes the Frobenius norm of 
a matrix, i.e., the square root of the sum of the squares of 
the entries. This problem can be efficiently solved via the 
singular value decomposition (SVD) and enjoys a number 
of optimality properties when the noise is small and i.i.d. 
Gaussian. 

While Robust PCA shares the same problem definition 
(1) as classical PCA, it does not have the same simplifying 
assumptions about the noise. Unlike the small noise in 
classical PCA, the entries in So can have arbitrarily large 
magnitude, and their support is assumed to be sparse but 
unknown.^ 

III. Approach 

This section demonstrates how to cast the matrix decom¬ 
position problem as Principal Component Pursuit (PCP) and 
discusses our choice of algorithm to solve it. 

A. Principal Component Pursuit 

We obtain the value function components through the PCP 
estimate^ 

minimize ||L||,^ -I- A ||5||[ 
subject to L + S = M, 

which can be solved by tractable convex optimization. Here, 
INI* = (■) is ths nuclear norm of a matrix; i.e., the sum 

of its singular values. H-Hj denotes the fi-norm of a matrix 
seen as a long vector in R"*". Assuming that the low-rank 
component Lq is not sparse and that the sparsity pattern of 
the sparse component So is selected uniformly at random, the 
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simple PCP solution perfectly recovers the low-rank and the 
sparse components [12]. In particular, all that PCP requires 
about Lo is that its singular vectors are not spiky; i.e., the 
foo-norm of any singular vector is not too large. To avoid 
any ambiguity, our model for So is this: take an arbitrary 
matrix S and set to zero its entries on a random set; this 
gives So. 

B. Identifiability of low-rank and sparse components 

To make the problem of matrix decomposition meaningful, 
we impose that the low-rank component Lq is not sparse. We 
consider the general notion of incoherence introduced in [13] 
for the matrix completion problem; this assumption concerns 
the singular vectors of the low-rank component. We write the 
singular value decomposition of Lq e as 

r 

Lo = USV'^ = y^ajUivJ, 

i = \ 

where r is the rank of the matrix, o\,...,ar are the positive 
singular values, and U = [ui, ... ,Ur], V = [i;i,...,t;r] are 
the matrices of the left- and right-singular vectors. The 
incoherence condition with parameter jx states that 

ttT ^ ^ Pr 2 fj^r 

max U Ci < —, max V Ci < —, (3) 

! 2 m i 2 n 

and 

UV'^ < (4) 

oo V ttin 

Above, we define H-Htx,, i-e., the norm of a matrix seen as 
a long vector. As discussed in [6], [13], [14], the incoherence 
condition asserts that for small values of pt, the singular 
vectors are not spread out; i.e., not sparse. 

Another issue is if the sparse matrix has low-rank. This 
will occur if, say, all the nonzero entries of S occur in a 
column or in a few columns. Consider the case where the 
first column of is the opposite of that of Lq, and where all 
the other columns of So vanish. Then it is clear that we would 
not be able to recover Lq and So since M = Lo-\- So would 
have a column space equal to or included in that of Lq. To 
avoid such situations, we will assume that the sparsity pattern 
of the sparse component is selected uniformly at random. 

C. Perfect recovery via PCP 

Surprisingly, the simple PCP solution perfectly recovers 
the low-rank and sparse components under the minimal as¬ 
sumptions above. Of course, we also require that the rank of 
the low-rank component is not too large, and that the sparse 
component is reasonably sparse. Below, Wi = max{OT,n} and 
«2 = min{OT,n}. 

Theorem 1. Suppose Lo is m xn, obeys (3) and (4), and 
that the support set of So is uniformly distributed among 
all sets of cardinality z. Then there is a numerical constant 
c such that with probability at least 1 —(over the 
choice of support of So), Principal Component Pursuit (2) 



with X = Xjis exact; i.e., L = Lq and S = Sq, provided 
that 

rank Lq < prn 2 pt~^ (Xogn\)~^ and z<pzmn. 
Above, Pr and Pz are positive numerical constants. 

In other words, matrices Lq whose principal components 
are reasonably spread can be recovered with probability 
nearly one from arbitrary and completely unknown corrup¬ 
tion patterns as long as these are randomly distributed. In 
fact, this works for large values of the rank; i.e., on the order 
of n^l (log«i)^ when /x is not too large. 

Another remarkable property is that there is no tuning 
parameter in this algorithm. Under the assumption of Theo¬ 
rem 1, minimizing 

11-^ II * d- , II II1 

ymax{OT,n} 

always returns the correct answer; i.e., in Eq. (2) choose 


— /- 

y max{m,n} 

In fact, the proof of the theorem in [12] gives a whole range 
of correct X values, and this is a sufficiently simple value in 
that range. 

D. Algorithm 

For small problem sizes, say max{m,n} < 100, PCP 
can be performed using off-the-shelf tools such as interior 
point methods [15]. This was suggested for low-rank and 
sparse decomposition in [14]. However, despite their superior 
convergence rates, interior point methods are limited by the 
O (m^) complexity of computing a step direction. More so¬ 
phisticated methods with better complexity and convergence 
rates include iterative thresholding methods using contin¬ 
uation techniques [16], [17], Bregman iterations [18], and 
Nesterov’s optimal first-order algorithm for smooth and non¬ 
smooth minimization [19]-[21]. An Accelerated Proximal 
Gradient (APG) algorithm was suggested for low-rank and 
sparse decomposition in [22]. APG inherits the optimal 
0{\/k^^ convergence rate for this class of problems, and 
empirical evidence suggests that it can solve the convex PCP 
problem at least 50 times faster than straightforward iterative 
thresholding. 

Despite its good convergence guarantees, however, the 
practical performance of APG does not show good accuracy 
and convergence across a wide variety of problem settings 
[23]. In this paper, we choose to instead solve the convex 
PCP problem Eq. (1) using an augmented Lagrange multi¬ 
plier (ALM) algorithm introduced in [23], [24]. [12] reports 
that ALM achieves much higher accuracy than APG, in fewer 
iterations, and that it works stably across a wide range of 
problem settings with no parameter tuning. 

The ALM method operates on the augmented Lagrangian 

L{L,S,Y)=\\L\\,, + X\\S\\^ + {Y,M-L-S) 

+ f^\\M-L-S 
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where (•) denotes the standard trace inner product. A generic 
Lagrange multiplier algorithm would solve PCP by iter¬ 
atively computing := argmin^ 

and then updating the Lagrange multiplier matrix via 
Y(k+1) + p.{M - [25]. 

For our decomposition problem, we avoid solving 
a sequence of convex programs by recognizing that 
miiiL L(L,S,Y} and min.s L (L, 5, T) both have very sim¬ 
ple and efficient solutions [12]. Let 5^ ; R ^ R denote 
the shrinkage operator S^x = sign{x)max(\x\—x,0), and 
extend it to matrices by applying it to each element. It can 
be shown that 

argminL (L, 5, Y) = {M — L + P^^Y) . 

S 

Similarly, for matrices X, let (Z) denote the singular 
value thresholding operator given by (Z) = USr (Z) U^, 
where Z = C/ZU^ is any SVD. Again, we can show that 

argminL (L, 5 , Y) = D^-i [M — S + P~^Y) . 

Thus, a more efficient strategy is to first minimize L with 
respect to L (fixing S), minimize L with respect to S (fixing 
L), and then finally update the Lagrange multipler matrix Y 
based on the residual M — L — S . Algorithm 1 summarizes 
this strategy. We choose p. = mn/ A\\M\\y, as suggested in 
[24], and terminate the algorithm when ||M —L —5||^ < 
6\\M\\p, with 8 = 10“^ 


Algorithm 1 PCP by Alternating Directions [23], [24] 

1 : initialize: Sq = Yq = 0, p. > 0 
2: while not converged do 

4: := -I 

6: output: L, S 


IV. Numerical Experiments 

To validate our approach, we apply it on two classical 
stochastic problems: the mountain car and inverted pendu¬ 
lum. The performance of our low-rank plus sparse models is 
evaluated against the true state-action value functions. 

A. Mountain car 

Following the problem definition in [5], the car starts from 
the position-velocity pair {x,x) and follows the dynamics 

X := i-I-0.001a —0.0025cos (3x) 

X := x + X, 

where a e [—1,1] is the acceleration input. The car can take 
on the state values (x,x) e [—0.07,0.07] x [—1.2,0.6]. To 
incentivize getting to the top of the mountain at xq = 0.5, 
the reward function is defined 
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R(x) = 


10, X > Xq 
— 1, otherwise. 





B. Inverted pendulum 

In this problem, we are interested in balancing an inverted 
pendulum in its unsteady upright equilibrium position. The 
system is described by the angle-angular speed tuple 
and its dynamics are 

0 != 0 -l- Odt 

9 \= 9 + ^sin0 — 9 + dt, 

where dt = 0.3 is the time period between decisions and 
T G [—1,1] is the torque input. The state space is (—tt, jr] x 
[—10,10]. The reward function penalizes control effort while 
favoring an upright pendulum angular position at 0°: 

R (s,a) = exp (cos 0 — 1) — O.la^. 

C. Discretization 

The two test problems are continuous MDPs. To solve 
them via value iteration, their state and action spaces are 
discretized into fine grids, and their transitions are modeled 
using the multilinear interpolation [26] 

T (s,a,s') = Prob (i' | s,a) 

= ^^Prob (^' I 4) Prob | i’,a), 

4 

where s' is the continuous state evolved from taking action a 
at state s. Here, Prob (s' | s^) is specified using multilinear 
interpolation, and Prob (s^ I is the problem dynamics. 
In our problems, is deterministically computed from the 
state-action pair (s,a), so T reduces to 

T (s,a,s') = Prob | s'^). 

The discretization scheme is summarized in Table I. 


TABLE I: Discretization scheme 


Problem 

Variable 

Min. 

Max. 

No. of values 


X 

-0.07 

0.07 

50 

Mountain car 

X 

-1.2 

0.6 

50 


a 

-1 

1 

1000 


e 

—n 

n 

50 

Inverted pendulum 

e 

-10 

10 

50 


T 

-1 

1 

1000 


With this scheme, we can extract the low-rank plus sparse 
policy for some continuous state Sc via 

It (sc) = argmax Prob {s \ Sc) Q {s,a). 
asA ^ 

Evaluating Q(s,a) involves finding the action corresponding 
to the column with the maximum value in the row corre¬ 
sponding to state s in the matrix M = L + S. The lookup can 
be done efficiently in real-time by taking the relevant matrix- 
vector product for the singular vectors and values encoding 
L and adding that to the appropriate sparse row in S. 


D. Evaluation criteria 

To evaluate our controllers, we run 1,000,000 simulations 
on both MDPs and compare the performance of the optimal 
and low-rank plus sparse model policies. 

1) Mountain car: The evaluation metric for the controllers 
is how long it takes to reach the mountain top given a 
randomly and uniformly generated initial configuration. 

2) Inverted pendulum: The metric for the inverted pen¬ 
dulum controllers is how well the controller can keep the 
pendulum in the upright position. This metric is captured 
by the average Euclidean distance between the pendulum 
angular position and the upright position. The initial states 
are drawn uniformly at random from the state-space. 

E. Implementation 

All computation was carried out on a system with a dual¬ 
core Intel i7 processor, with clock speed 2.7 GHz and 8 
GB of RAM, running Mac OS X. The ALM algorithm was 
implemented by the authors of [22] on a single thread in 
MATLAB and C, which interacted with MATLAB through 
a MEX interface [27]. The value iteration algorithm and 
simulation program were implemented in Julia [28]. All code 
can be found together with documentation integrated onto a 
Jupyter notebook at 

https://github.com/haoyio/LowRankMDP. 

F. Results 

Table II summarizes the results based on the evaluation 
criteria described above. Note that sparsity is defined as the 
fraction of zero elements over the total number of elements 
in the sparse component S. The results suggest that there 
is little if any difference between the optimal policy and 
the one produced by the low-rank and sparse model—they 
achieve virtually the same level of performance as their 
optimal counterparts. This is despite the low-rank models 
requiring less than 2% and 13% of the number of entries 
in the original matrices for the mountain car and inverted 
pendulum problems, respectively. 


TABLE II: Summary simulation results 


Mountain car 

time-to-goal 

rank 

sparsity 

non-zero entries 

optimal 

low-rank 

54.461 

54.461 

1000 

11 

0 

0.009 

2.5x 10® 
4.887X 10^ 

Inv. pendulum 

deviation 

rank 

sparsity 

non-zero entries 

optimal 

low-rank 

0.441 

0.442 

1000 

50 

0 

0.075 

2.5x 10® 
3.136X 10® 


Eigure 1 shows the policy heat maps for the mountain 
car and inverted pendulum MDPs. The color of any cell 
indicates the numerical value of the best control input given 
the state. There is a slight difference between the mountain 
car policy heat maps on the top left and right corners of 
the insets. On the other hand, visual inspection reveals little 
to no difference between the inverted pendulum policy heat 
maps. Intuitively, this comparison demonstrates why the 




position X 


position X 


(a) Mountain car policy heat maps; notice the slight difference in the top left and right comers of the heat maps. 
Original policy Low-rank -i- sparse policy 



angular position 6 (rad) angular position 6 (rad) 


(b) Inverted pendulum policy heat maps; there is barely any difference between the two. 

Fig. 1: Visual comparison of policy heat maps for the true and low-rank plus sparse models reveals little difference. 


performance of the low-rank model was essentially identical 
to that of the original. 

V. Conclusion AND Future Work 
We have demonstrated a novel value function approxima¬ 
tion technique that exploits the intrinsic low dimensionality 
of MDPs. State-action value functions of simple continuous 
MDPs can be approximated virtually to perfection with far 
fewer memory requirements. It remains to experiment with a 
wider variety of MDPs to determine if the approach general¬ 
izes well. In the vein of applying Robust PCA to MDPs, an 
interesting research direction will be to frame reinforcement 
learning as a sequential noisy matrix completion problem. 
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